English-Turkish Parallel Treebank with Morphological Annotations and its Use in Tree-based SMT
نویسندگان
چکیده
In this paper, we report our tree based statistical translation study from English to Turkish. We describe our data generation process and report the initial results of tree-based translation under a simple model. For corpus construction, we used the Penn Treebank in the English side. We manually translated about 5K trees from English to Turkish under grammar constraints with adaptations to accommodate the agglutinative nature of Turkish morphology. We used a permutation model for subtrees together with a word to word mapping. We report BLEU scores under simple choices of inference algorithms.
منابع مشابه
Building A Case-based Semantic English-Chinese Parallel Treebank
Abstract We construct a case-based English-to-Chinese semantic constituent parallel Treebank for a Statistical Machine Translation (SMT) task by labelling each node of the Deep Syntactic Tree (DST) with our refined semantic cases. Since subtree span-crossing is harmful in tree-based SMT, DST is adopted to alleviate this problem. At the same time, we tailor an existing case set to represent bili...
متن کاملConstructing a Turkish-English Parallel TreeBank
In this paper, we report our preliminary efforts in building an English-Turkish parallel treebank corpus for statistical machine translation. In the corpus, we manually generated parallel trees for about 5,000 sentences from Penn Treebank. English sentences in our set have a maximum of 15 tokens, including punctuation. We constrained the translated trees to the reordering of the children and th...
متن کاملThe English-Swedish-Turkish Parallel Treebank
We describe a syntactically annotated parallel corpus containing typologically partly different languages, namely English, Swedish and Turkish. The corpus consists of approximately 300 000 tokens in Swedish, 160 000 in Turkish and 150 000 in English, containing both fiction and technical documents. We build the corpus by using the Uplug toolkit for automatic structural markup, such as tokenizat...
متن کاملRefining Word Segmentation Using a Manually Aligned Corpus for Statistical Machine Translation
Languages that have no explicit word delimiters often have to be segmented for statistical machine translation (SMT). This is commonly performed by automated segmenters trained on manually annotated corpora. However, the word segmentation (WS) schemes of these annotated corpora are handcrafted for general usage, and may not be suitable for SMT. An analysis was performed to test this hypothesis ...
متن کاملTurkish Treebank as a Gold Standard for Morphological Disambiguation and Its Influence on Parsing
So far predicted scenarios for Turkish dependency parsing have used a morphological disambiguator that is trained on the data distributed with the tool(Sak et al., 2008). Although models trained on this data have high accuracy scores on the test and development data of the same set, the accuracy drastically drops when the model is used in the preprocessing of Turkish Treebank parsing experiment...
متن کامل